Conference Proceedings

CC-News-En: A Large English News Corpus

Joel Mackenzie, Rodger Benham, Matthias Petri, Johanne R Trippas, J Shane Culpepper, Alistair Moffat

CIKM '20: Proceedings of the 29th ACM International Conference on Information & Knowledge Management | ACM | Published : 2020

Abstract

We describe a static, open-access news corpus using data from the Common Crawl Foundation, who provide free, publicly available web archives, including a continuous crawl of international news articles published in multiple languages. Our derived corpus, CC-News-En, contains 44 million English documents collected between September 2016 and March 2018. The collection is comparable in size with the number of documents typically found in a single shard of a large-scale, distributed search engine, and is four times larger than the news collections previously used in offline information retrieval experiments. To complement the corpus, 173 topics were curated using titles from Reddit threads, form..

View full abstract

University of Melbourne Researchers

Grants

Awarded by Australian Research Council


Funding Acknowledgements

We thank Sebastian Nagel (Common Crawl) for providing useful information about the news crawl, and the Common Crawl organization for their commitment to providing open data. This work was partially supported by Australian Research Council Grants DP170102231, DP190101113, and DP200103136. The second author was supported by an RMIT VCPS.